Performing exploratory data analysis for a maximum of three features, comparing the listings in Lisbon and Porto.
Development of a decision tree predicting the place availability for the next year (column “availability_365”). More specifically, predict whether the place will be available for at least 280 days out of the 365 days.
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
# Visualization Imports
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
df_lisbon = pd.read_csv('listings_lisbon.csv')
df_porto = pd.read_csv('listings_porto.csv')
df = pd.merge(df_lisbon, df_porto, how='outer')
df.head()
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6499 | Belém 1 Bedroom Historical Apartment | 14455 | Bruno | Lisboa | Belm | 38.69750 | -9.19768 | Entire home/apt | 40 | 3 | 27 | 2021-01-26 | 0.34 | 1 | 341 |
| 1 | 25659 | Heart of Alfama - Lisbon Center | 107347 | Ellie | Lisboa | Santa Maria Maior | 38.71167 | -9.12696 | Entire home/apt | 30 | 10 | 113 | 2019-12-08 | 1.36 | 1 | 108 |
| 2 | 29248 | Apartamento Alfama com vista para o rio! | 125768 | Bárbara | Lisboa | Santa Maria Maior | 38.71272 | -9.12628 | Entire home/apt | 38 | 3 | 325 | 2021-01-10 | 2.64 | 1 | 303 |
| 3 | 29396 | Alfama Hill - Boutique apartment | 126415 | Mónica | Lisboa | Santa Maria Maior | 38.71156 | -9.12987 | Entire home/apt | 25 | 2 | 265 | 2021-01-22 | 2.49 | 2 | 323 |
| 4 | 29915 | Modern and Cool Apartment in Lisboa | 128890 | Sara | Lisboa | Avenidas Novas | 38.74712 | -9.15286 | Entire home/apt | 48 | 5 | 40 | 2021-01-24 | 0.31 | 1 | 294 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 31005 entries, 0 to 31004 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 31005 non-null int64 1 name 30994 non-null object 2 host_id 31005 non-null int64 3 host_name 30995 non-null object 4 neighbourhood_group 31005 non-null object 5 neighbourhood 31005 non-null object 6 latitude 31005 non-null float64 7 longitude 31005 non-null float64 8 room_type 31005 non-null object 9 price 31005 non-null int64 10 minimum_nights 31005 non-null int64 11 number_of_reviews 31005 non-null int64 12 last_review 25452 non-null object 13 reviews_per_month 25452 non-null float64 14 calculated_host_listings_count 31005 non-null int64 15 availability_365 31005 non-null int64 dtypes: float64(3), int64(7), object(6) memory usage: 4.0+ MB
# Check for missing values
df.isnull().sum()
id 0 name 11 host_id 0 host_name 10 neighbourhood_group 0 neighbourhood 0 latitude 0 longitude 0 room_type 0 price 0 minimum_nights 0 number_of_reviews 0 last_review 5553 reviews_per_month 5553 calculated_host_listings_count 0 availability_365 0 dtype: int64
For simplicity I will drop all NaN rows.
df.dropna(inplace=True)
print(f'After dropping missing values we end up with {len(df)} rows per {len(df.columns)} features')
After dropping missing values we end up with 25442 rows per 16 features
# Create target variable
df.loc[df['availability_365'] < 280, 'availability_365'] = 0
df.loc[df['availability_365'] >= 280, 'availability_365'] = 1
df['availability_365'].unique()
array([1, 0], dtype=int64)
df.availability_365.value_counts()
0 13030 1 12412 Name: availability_365, dtype: int64
print(f"""
The target variable `availability_365` distribution is:\n
{round(list(df.availability_365.value_counts())[0]/sum(list(df.availability_365.value_counts()))*100,3)}% for < 280 - (0)
{round(list(df.availability_365.value_counts())[1]/sum(list(df.availability_365.value_counts()))*100,3)}% for >= 280 - (1)
""")
The target variable `availability_365` distribution is:
51.215% for < 280 - (0)
48.785% for >= 280 - (1)
last_reviews will signal active listings and these are more likely to be available for a longer period of time or just to be re-listed. Furthermore, last_reviews in a given month, perhaps the beginning of the year are more likely to be available for a longer period of time.df['last_review'] = pd.to_datetime(df['last_review'], format="%Y-%m-%d")
df.last_review.describe(datetime_is_numeric=True)
count 25442 mean 2020-03-20 03:54:36.251867392 min 2011-07-22 00:00:00 25% 2019-10-30 00:00:00 50% 2020-07-26 00:00:00 75% 2020-10-25 00:00:00 max 2021-02-14 00:00:00 Name: last_review, dtype: object
df.groupby('last_review')['availability_365'].sum().values
array([ 0, 1, 1, ..., 18, 14, 15], dtype=int64)
df['year'] = df['last_review'].map(lambda x: x.year)
aux_data1 = pd.DataFrame(df.groupby(['year','availability_365'])['id'].count()).reset_index().rename(columns={'id':'count'})
aux_data1['availability_365'] = pd.Categorical(aux_data1.availability_365)
px.bar(aux_data1, x='year', y='count', color='availability_365', barmode='group', text='count',
title = f"""The more recent is the last review, the more likely (in proportion) we are to have availability <br>for > 280 days""")
The latest the last review is, the more likely the entry is to be listed for at least 280 days availability. This means that listings with older reviews are more likely to have a Smaller than 280 days availability.
df['month'] = df['last_review'].map(lambda x: x.month)
aux_data = pd.DataFrame(df.groupby(['month','availability_365'])['id'].count()).reset_index().rename(columns={'id':'count'})
aux_data['availability_365'] = pd.Categorical(aux_data.availability_365)
px.bar(aux_data, x='month', y='count', color='availability_365', barmode='group', text='count',
title = f"""The month with the highest availabilty for >= 280 days is January, with 2247 entries""")
There seems to be a relation between some of the months and the number of entries with more than 280 days available for the year. Namely, January, August, September and October.
There is also some apparent ciclicality throughout the year, where January is the highest month, then a sharp decrease until April, followed by a slow increase until July, speeding up in the end of summer months (August, September and October) and then again a decrease to November and December.
aux_data2 = pd.DataFrame(df.groupby(['room_type','availability_365'])['id'].count()).reset_index().rename(columns={'id':'count'})
aux_data2['availability_365'] = pd.Categorical(aux_data2.availability_365)
px.bar(aux_data2, x='room_type', y='count', color='availability_365', barmode='group', text='count',
title = f"""Room type doesn't seem to significantly distinguish >=280 from <280 availability except for Private Room""")
By itself this feature does not split the data in two sections being very much balanced for the 4 categories. However it is noticeable that Entire home/apt is the most frequent category, followed by Private room which actually seems to be more present in Smaller than 280 days availability.
When combined with other features this may hold some usefull explanatory power.
df['minimum_nights_Percentile'] = pd.qcut(df["minimum_nights"].rank(method='first'), 10)
df['minimum_nights_Percentile'] = df['minimum_nights_Percentile'].astype('str')
aux_data3 = pd.DataFrame(df.groupby(['minimum_nights_Percentile','availability_365'])['id'].count()).reset_index().rename(columns={'id':'count'})
aux_data3['availability_365'] = pd.Categorical(aux_data3.availability_365)
aux_data3['minimum_nights_Percentile']
px.bar(aux_data3, x='minimum_nights_Percentile', y='count', color='availability_365', barmode='group', text='count',
title = f"""A lower minimum number of nights seems to result in more entries with availability >=280""")
A lower minimum number of nights seems to result in more entries with availability >280, this relation inverts as the minimum number of nights increases.
From bin 1 to 3 availabilities seem to be balanced, however, from 4 to 7 there is a clear shift toward Smaller than 280 days availability. Then this relationship inverts from 8 to 9 with at least 280 days availability becoming dominant and then both being balanced again in the 10th bin.
room_type, neighbourhoodminimum_nights_Percentilemonth would need a cyclical transformation, however I will only be using the time feature year which does not need any transformation# Check for unique values
features = ['minimum_nights_Percentile', 'number_of_reviews', 'neighbourhood','room_type', "month", "year"]
for i in features:
size = len(list(df[i].unique()))
print(f'- "{i}" has {size} unique variables')
- "minimum_nights_Percentile" has 10 unique variables - "number_of_reviews" has 470 unique variables - "neighbourhood" has 264 unique variables - "room_type" has 4 unique variables - "month" has 12 unique variables - "year" has 11 unique variables
# label encoding
from sklearn import preprocessing
label_encoding = ['neighbourhood', "room_type"]
encoder = []
for i in label_encoding:
le = preprocessing.LabelEncoder()
le.fit(df[i])
encoder.append(le)
df[i+'_encoded'] = le.transform(df[i])
# View the labels
list(encoder[1].classes_)
['Entire home/apt', 'Hotel room', 'Private room', 'Shared room']
list(encoder[1].inverse_transform([0]))
['Entire home/apt']
from sklearn.preprocessing import OrdinalEncoder
categories = [['(0.999, 2545.1]', '(2545.1, 5089.2]', '(5089.2, 7633.3]', '(7633.3, 10177.4]', '(10177.4, 12721.5]', '(12721.5, 15265.6]', '(15265.6, 17809.7]', '(17809.7, 20353.8]', '(20353.8, 22897.9]', '(22897.9, 25442.0]']]
#Instantiate ordinal encoder
ordinal_encoder = OrdinalEncoder(categories=categories)
#Fit ordinal encoder
ordinal_encoder.fit(df[['minimum_nights_Percentile']])
# transform the data
df['minimum_nights_Percentile_encoded'] = ordinal_encoder.transform(df[['minimum_nights_Percentile']])
list(ordinal_encoder.categories_)
[array(['(0.999, 2545.1]', '(2545.1, 5089.2]', '(5089.2, 7633.3]',
'(7633.3, 10177.4]', '(10177.4, 12721.5]', '(12721.5, 15265.6]',
'(15265.6, 17809.7]', '(17809.7, 20353.8]', '(20353.8, 22897.9]',
'(22897.9, 25442.0]'], dtype=object)]
list(ordinal_encoder.inverse_transform(df[['minimum_nights_Percentile_encoded']]))[:5]
list(ordinal_encoder.inverse_transform([[6]]))
[array(['(15265.6, 17809.7]'], dtype=object)]
minimum_nights_Percentile_encoded room_type_encoded year# imports
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import plot_roc_curve, roc_curve, RocCurveDisplay
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, f1_score, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn import tree
# Split dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(df[['minimum_nights_Percentile_encoded','room_type_encoded','year']],
df['availability_365'], test_size=0.2, random_state=42)
params_dist = {
'criterion': ['gini', 'entropy'],
'max_depth': randint(low=1, high=10),
'max_leaf_nodes': randint(low=1, high=20000),
'min_samples_leaf': randint(low=1, high=100),
'min_samples_split': randint(low=1, high=100)
}
clf_tuned = DecisionTreeClassifier(random_state=42)
random_search = RandomizedSearchCV(clf_tuned, params_dist, cv=10, scoring='f1', random_state=42)
random_search.fit(X_train, y_train)
clf_best_tuned = random_search.best_estimator_
cv_results = pd.DataFrame(random_search.cv_results_)
random_search.best_estimator_
DecisionTreeClassifier(max_depth=3, max_leaf_nodes=1900, min_samples_leaf=55,
min_samples_split=64, random_state=42)
cv_results.sort_values('mean_test_score', ascending=False).head(5)
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_criterion | param_max_depth | param_max_leaf_nodes | param_min_samples_leaf | param_min_samples_split | params | ... | split3_test_score | split4_test_score | split5_test_score | split6_test_score | split7_test_score | split8_test_score | split9_test_score | mean_test_score | std_test_score | rank_test_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 0.005954 | 0.000294 | 0.002173 | 0.000374 | gini | 3 | 1900 | 55 | 64 | {'criterion': 'gini', 'max_depth': 3, 'max_lea... | ... | 0.396296 | 0.419176 | 0.508423 | 0.434885 | 0.391779 | 0.527806 | 0.446035 | 0.454272 | 0.047283 | 1 |
| 7 | 0.005869 | 0.001200 | 0.002392 | 0.000491 | gini | 3 | 3557 | 51 | 7 | {'criterion': 'gini', 'max_depth': 3, 'max_lea... | ... | 0.396296 | 0.419176 | 0.508423 | 0.434885 | 0.391779 | 0.527806 | 0.446035 | 0.454272 | 0.047283 | 1 |
| 5 | 0.006876 | 0.000775 | 0.002508 | 0.000498 | gini | 6 | 6397 | 89 | 49 | {'criterion': 'gini', 'max_depth': 6, 'max_lea... | ... | 0.418519 | 0.429440 | 0.427536 | 0.377531 | 0.379032 | 0.421594 | 0.426966 | 0.430068 | 0.033544 | 3 |
| 2 | 0.006899 | 0.000538 | 0.002401 | 0.000490 | gini | 8 | 11364 | 24 | 3 | {'criterion': 'gini', 'max_depth': 8, 'max_lea... | ... | 0.402270 | 0.431637 | 0.417855 | 0.371812 | 0.384461 | 0.442928 | 0.443618 | 0.428329 | 0.032141 | 4 |
| 8 | 0.007932 | 0.001191 | 0.002584 | 0.000489 | gini | 9 | 14503 | 18 | 4 | {'criterion': 'gini', 'max_depth': 9, 'max_lea... | ... | 0.394402 | 0.425479 | 0.422483 | 0.382275 | 0.386059 | 0.442500 | 0.481246 | 0.426255 | 0.029953 | 5 |
5 rows × 23 columns
plt.figure(figsize=(30,10))
plot_tree(clf_best_tuned, class_names=['Smaller Than 280','At least 280'],
feature_names=['minimum_nights_Percentile_num','room_type_num','year'], filled=True)
plt.show()
The binary tree structure has 15 nodes and dept of 3:
The first split uses feature minimum_nights_Percentile_num that corresponds to the minimum required nights spent at the given listing encapsulated into 10 equally distributed intervals that were transformed in hierarquical order.
This splits the minimum nights in half (0.999 to 15265.6) and (15265.6 to 25442.0), and as we have seen above in the EDA, around bin number 5 we have a significant increase in (0) target instances against (1).
Then, two nodes are created and node-1 splits into Year, and as we've seen, the more recent the last_review the more likely the listing is to be available for at least 280 days. And on node-2, we split by room type, that is balanced in its categories except in private-rooms that seem to be avaliable more times for less than 280 days.
Then node-1 divides again into 2 nodes for the final split, using in the left side minimum_nights_Percentile_num <= 2.5 i.e. (0.999 to 7633.3) and on the right room_type <= 1.5 that is to say room_type in ['Entire home/apt', 'Hotel Room'].
Node-2 on the right divides into minimum_nights_Percentile_num <= 7.5 i.e. (15265.6 to 20353.8). On the left and year <= 2018.5 on the right.
minimum nights requirement, older last_reviews given here by the year and room_type between Hotel and Entire home/apt. The opposite is valid for availability smaller than 280 days.room_type between Hotel and Entire home/apt. room_types, again, older last_reviews predict At least 280 days availability, while more recent last_reviews predict Smaller than 280 days availability.print(' -------------Tuned Model-------------\n',classification_report(y_test, clf_best_tuned.predict(X_test)))
-------------Tuned Model-------------
precision recall f1-score support
0 0.54 0.69 0.60 2569
1 0.55 0.39 0.46 2520
accuracy 0.54 5089
macro avg 0.55 0.54 0.53 5089
weighted avg 0.55 0.54 0.53 5089
plot_confusion_matrix(clf_best_tuned, X_test, y_test, display_labels=['Smaller Than 280','At least 280'])
plt.title('Tunned Model (f1-score)')
plt.show()
The Tunned Model has an overall balanced performace, minimizing Type I error (False Positives) and achieving a high recall (0.69) for the Smaller than 280 target. Given that this is a balanced dataset where the target variable is roughly distributed 50/50 between 0 and 1, this high recall for the (0) target is also very important.
Regarding the model, we say that it is a "pessimistic" model as it tends to for type II error (False Negatives).
It is also very interesting to notice that while the Hyper-parameter tunning allows for a max_dept of 10, the Tunned Model, maximizes the f1-score at max_dept=3 using all 3 features.
Next steps would be to evaluate changes in performace when allowing for even larger trees, and explore the impact from features, namely neighbourhood and months to check for possible cyclicality.
n_nodes = clf_best_tuned.tree_.node_count
children_left = clf_best_tuned.tree_.children_left
children_right = clf_best_tuned.tree_.children_right
feature = clf_best_tuned.tree_.feature
threshold = clf_best_tuned.tree_.threshold
node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, 0)] # start with the root node id (0) and its depth (0)
while len(stack) > 0:
# `pop` ensures each node is only visited once
node_id, depth = stack.pop()
node_depth[node_id] = depth
# If the left and right child of a node is not the same we have a split
# node
is_split_node = children_left[node_id] != children_right[node_id]
# If a split node, append left and right children and depth to `stack`
# so we can loop through them
if is_split_node:
stack.append((children_left[node_id], depth + 1))
stack.append((children_right[node_id], depth + 1))
else:
is_leaves[node_id] = True
print("The binary tree structure has {n} nodes and has "
"the following tree structure:\n".format(n=n_nodes))
for i in range(n_nodes):
if is_leaves[i]:
print("{space}node={node} is a leaf node.".format(
space=node_depth[i] * "\t", node=i))
else:
print("{space}node={node} is a split node:"
"go to node {left} if '{feature}' <= {threshold} "
"else to node {right}.".format(
space=node_depth[i] * "\t",
node=i,
left=children_left[i],
feature=X_train.columns[feature[i]],
threshold=threshold[i],
right=children_right[i]))
The binary tree structure has 15 nodes and has the following tree structure: node=0 is a split node:go to node 1 if 'minimum_nights_Percentile_encoded' <= 5.5 else to node 2. node=1 is a split node:go to node 3 if 'year' <= 2020.5 else to node 4. node=2 is a split node:go to node 5 if 'room_type_encoded' <= 1.5 else to node 6. node=3 is a split node:go to node 11 if 'minimum_nights_Percentile_encoded' <= 2.5 else to node 12. node=4 is a split node:go to node 13 if 'room_type_encoded' <= 1.5 else to node 14. node=5 is a split node:go to node 9 if 'minimum_nights_Percentile_encoded' <= 7.5 else to node 10. node=6 is a split node:go to node 7 if 'year' <= 2018.5 else to node 8. node=7 is a leaf node. node=8 is a leaf node. node=9 is a leaf node. node=10 is a leaf node. node=11 is a leaf node. node=12 is a leaf node. node=13 is a leaf node. node=14 is a leaf node.